Japanese Word Segmentation Using Similarity Measure for IR
نویسندگان
چکیده
*U SFNBJOT BO PQFO RVFTUJPO XIBU BSF UIF CFTU VOJUT GPS *3 JO QBSUJDVMBS GPS "TJBO MBOHVBHFT XPSET QISBTFT CJHSBNT PS OHSBNT 0VS QSPQPTBM JT UIBU UIF CFTU VOJUT BSF XIBU NBYJNJ[F B TJNJMBSJUZ NFBTVSF CFUXFFO B RVFSZ BOE B EPDVNFOU 5IBU JT JO UIJT GSBNFXPSL UIF *3 TZTUFN TIPVME IBWF EJGGFSFOU SFQSFTFOUBUJPOT PG B RVFSZ GPS FBDI EPDVNFOU 8F EFWFMPQ UIF NFUIPE XIJDI TFHNFOUT B RVFSZ JOUP VOJHSBNT CJHSBNT BOE BSCJUSBSZ MFOHUI OHSBNT VTJOH B TJNJMBSJUZ NFBTVSF TVDI BT UG JEG BT UIF DSJUFSJB GPS UIF TFHNFOUBUJPO &YQFSJNFOUBM SFTVMUT TIPX UIBU UIF NFUIPE UBLFT BEWBOUBHF PG UFDIOJDBM UFSNT XIJDI UFOE UP CF MPOHFS UIBO CJHSBNT BOE JOUFHSBUFT UIF BEWBOUBHFT PG UIF XPSE CBTFE NFUIPE BOE UIF OHSBN CBTFE NFUIPE XJUIPVU ESBXCBDLT PG CPUI
منابع مشابه
Chinese Word Segmentation Accuracy and Its Effects on Information Retrieval
In Chinese information retrieval (IR), word segmentation is an essential prerequisite process to break down the documents into smaller linguistic units or word segments so that they can be indexed for subsequent retrieval. Despite a host of Chinese information systems that are in existence today, little work has been done to study word segmentation accuracy and its effect on IR. This article de...
متن کاملA Japanese-to-English Statistical Machine Translation System for Technical Documents
This thesis addresses a Japanese-to-English statistical machine translation (SMT) system for technical documents. Machine translation (MT) is a promising solution for growing translation needs. Japanese-to-English MT is one of the most difficult language pairs due to their large lexical and syntactic differences. This thesis work focuses on patents as the most demanded technical documents that ...
متن کاملBerkeley at NTCIR-2: Chinese, Japanese, and English IR experiments
This paper reports on the work of Berkeley group at the second NTCIR workshop on Japanese & English IR and Chinese IR. A number of runs were submitted on all subtasks in the two main tasks. Our main focus on the Japanese monolingual subtask was on comparing the retrieval effectiveness of different segmentation methods. The experimental results show the bigram indexing outperformed the word-base...
متن کاملCohesion and Collocation: Using Context Vectors in Text Segmentation
Collocational word similarity is considered a source of text cohesion that is hard to measure and quantify. The work presented here explores the use of information from a training corpus in measuring word similarity and evaluates the method in the text segmentation task. An implementation, the VecTile system, produces similarity curves over texts using pre-compiled vector representations of the...
متن کاملStrategies of Processing Japanese Names and Character Variants in Traditional Chinese Text
This paper proposes an approach to identify word candidates that are not Traditional Chinese, including Japanese names (written in Japanese Kanji or Traditional Chinese characters) and word variants, when doing word segmentation on Traditional Chinese text. When handling personal names, a probability model concerning formats of names is introduced. We also propose a method to map Japanese Kanji...
متن کامل